National Addresses

Background

The United States is one giant network of roads connecting residential and commercial properties. Millions of people every day drive from one location to another. A to B. Today very little thought goes in to the actual mapping of this route as A and B are searchable through network-connected applications with significant algorithmic backends capable of finding the fastest route from A to B in milliseconds. A and B no longer need to be easily searchable on a map or organized in an intuitive numbering system. Without these constraints, how have we organized the world around us through addresses? Are there patterns or hidden stories in the way we name where we live and work?

Data is compiled from the National Address Database which currently contains over 78 million entries across more than 36 states. Unfortunately, not all data in select regions is publicly tabulated or available. Please see the map produced by the USDOT which details which counties have provided data for this dataset. I encourage you to read through the database site; however if you don't, I think they concisely summarize exactly why such a database is important to maintain:

"Accurate and up-to-date addresses are critical to transportation safety and are a vital part of Next Generation 9-1-1. They are also essential for a broad range of government services, including mail delivery, permitting, and school siting..."

Data Processing

The first hurdle we encounter is that the csv (comma separated values) database download of choice is a nearly 30 GB large. 30 GB of text is a monstrous amount of text. Loading and manipulating this amount of data without specialized processes or intentional system design will be very slow. Safe to say the original file will not be contained within the project's regular GitHub repository.
Before plotting anything, we should remove excess data that is not informative to us. The dataset contains 60 columns of data for all 78,078,347 individual addresses. This includes some duplicate data like the "StNam_Full" column which contains all of the data from eight additional columns. For instance, East 10th Street is listed as StNam_Full, but it has St_PreDir of "East", St_Name of "10th", and St_PosTyp of "Street".
After this processing is complete, the new filesize is only 6.1 GB. Still very large but a lot more manageable. Another cleaning technique used was that any data entry with fewer than 60 segments of information was immediately excluded. This cut our number of entries from 78 million to 56 million. Therefore, roughly 1/4 of the data provided may have been formatted incorrectly. Alternate dataset cleaning techniques could be used to attempt to map these rows to assumed columns, but given that the dataset already contains a subset of all overall addresses, excluding a few more which were not properly formatted should not significantly change trends.


Street Names

In the 56 million or so addresses in the dataset, they are described by 1,133,644 unqiue street names. We can look at the most common street names by total addresses (Figure A) as well as the most common street names by number of distinct counties to have a street with this name (Figure B). There are 1592 counties in this list for reference - so Main Street is present in almost 2/3 of all counties in the NAD. Some other quick descriptive facts about street names in this database:

  • The number of addresses per county on average is 35,003.
  • The median is only 11,964 with the fewest being 1 and the most with 1,877,615.
  • The number of distinct addresses per county road on average is 22.05
  • The median is only 8 with the fewest being 1 and the most with 10,613. In total there are 2,527,200 county-roads*

  • * County-roads are defined as such where one road can only exist in one county. E.g, if elm street passes between two counties, it is counted once as elm street in county A and once in county B.

    Looking through these two figures, a number of trends stand out to me. For one, most of the names can be categorized into just a few categories - see below for these categories (only Figure A included). Streets (S) and Avenues (A) are abbreviated in the following. (S/A) indicates both street and avenue versions are present in the top indices.

    Grid/Ordinal
    • Main S
    • Central Avenue
    • Center S
    • North S
    • (1, 2, 3, 4, 5, 6, 7) S
    • South S
    • West S
    • North 7th S
    • West (2, 3, 4, 5) S
    • East (2, 3, 4) S
    • 5th A
    Historical
    • (E/W/NA) Washington S/A
    • Lincoln S/A
    • Madison S/A
    • Franklin S
    • Jefferson S/A
    • Jackson S
    • Adams S
    Arboreal
    • Elm S
    • Oak S
    • Maple S/A
    • Walnut S
    • Pine S
    • Chestnut S
    • Cedar S
    • Forest A
    Geographic
    • Park S/A
    • River Road
    • Ridge Road
    • Riverside Drive
    • Atlantic A
    • Lake S
    • Lakeview Drive
    • Meadow Lane
    Services
    • Airport Road
    • Church S
    • Mill S
    • Market S
    Other
    • High S
    • Broadway
    • State S
    • Grand A
    • Highland S
    • Spring S
    • Sunset Drive
    • Broad S
    • Broadway S
    • Union S
    • Circle Drive
    • Front S
    • Locust S
    • Prospect S

    Unsurprisingly, the most common name for both metric is Main Street. For total number of addresses, there are nearly 400,000 addresses in the NAD listed on some variant of Main Street. Intuitively, main streets are common thoroughfares with residential and business buildings that are often multi-unit and stretch across wide urban areas. Some other standouts are just how many presidents and trees are represented in our naming system. Furthermore, Streets are far more common in both top 75 lists than Roads, Avenues, Drives, and other combined.

    Comparing the two subplots, the first thing that stands out is the relative re-ordering of some of the "big" names. For instance, almost all of the traditionally urban Avenues are no longer present in Figure B. Instead there are 6 more Drives and 3 more Roads and other. We also have a number of Streets that rose through the rankings and were not present on the first subplot but are now exceptionally common accross counties: Cherry Street (21st), Railroad Street (33), Hillcrest Drive (34), and more. The biggest human geography difference between these two subplots is that the first generally prioritizes urban routes lined with multi-unit buildings whereas the second is population agnostic and has greater proportional representation from rural counties. In these counties, geographic and arboreal naming schemes appear to be more prevalent than some of the traditionally urban names which might be historical or ordinal.

    Let's now take a look at the most populous single streets in this database. To make this distinction, we will sum every address per single road in each county. The results are shown in Figure C . Some of these results are relatively surprising to me. I would have assumed the figure would be dominated by Manhatten streets as the tall buildings would contain a large number of addresses. However, it appears as though no Manhattan streets are on the top 25 (The NY Main street is a very long thoroughfare in Buffalo, not NYC). Instead of being NY-dominated, this is a good representation of some of the longest continuous roads in major urban centers.

    Places 1, 3, 7, and 17 are streets that cut through almost the entirety of Washington DC and are lined by large multi-unit buildings. 14/25 are located in Maricopa County, AZ (Phoenix Metro Area). Phoenix is notorious for its grid layout and these major roads run uninterrupted through an area home to over 5 million residents. High Street in OH is the principal road of Columbus, as is Congress Street in Portland Maine. Atlantic Avenue in Virginia Beach is 3 miles of condominiums, and Old Hickory Avenue encircles Nashville TN. It even has its own Wikipedia page. Rfd in Lake County, IL is a little bit of a mystery to me. It could be a transcription or data entry error, or it could be actual addresses designated as a byproduct of an old USPS service: Rural Free Delivery.

    Another fun gimmick we can look at is the length of street names - mainly the longest ones. The first thing I discovered is that Ohio has a tendency to be obnoxious... After a little additional processing to remove Interstates, Highways, and "United States Routes", Figure D is the result. Had we kept some of these keywords, the figure would be dominated by Ohio's circumlocutory interstates and underpasses like "Interstate 270 South And United States Route 33 Underpass". In fact, out of the 250 "true" longest street names in the dataset, 187 are in Franklin County, OH, and 211 are in OH (~85%).

    There aren't too many trends in this list, but a couple might be Reverends & Redundancy. 1/5th of these names include some form of Dr. Martin Luther King Jr, and another is named after Pastor Jerome Jackson Senior. Buttermilk Falls Primitive Walking Trl Trail, North/South Black Canyon Access Road Access Road, and Panther Mountain Drive Extension Extension are all overtly redundant by using their street suffix twice. Route 695 Southbound To Route 695 Northbound is an egregious way to say "Route 695".
    My personal favorite is East Morgan Ave - East Of Taco Bell Avenue.


    Street Suffixes

    One curiosity gained while looking at the most common street names in this data set is the decision regarding street suffix, e.g., Street, Road, Lane, Boulevard, etc. Why is Dogwood Lane more common than Dogwood Drive which is more common than Dogwood Street or Avenue? How are these decided? For many I'm not sure if we will ever have a definitive answer. However, I have some couple plausible theories. I speculate that certain road names become more common and that contributes to a snowball effect where a name sounds better in the way it has been written before rather than deviate and create something new. Alternatively, the same underlying subconcious around naming might be consistent across time and geography contributing to independent "namers" converging on the same name. For example: River Road feels more natural than River Street or River Boulevard. There are pieces online that give more historical and utilitarian meaning for each suffix: Steamboat Agent & Visual Guide to Roads. Summarizing, these sources decribe our target suffixes as:

  • Road: general classification, or any route connecting two points
  • Street: usually urban, often E-W, has buildings on both sides, perpendicular to avenues
  • Drive: contoured, winding road shaped by natural environments, rooted in history of cattle drives
  • Avenue: usually urban, often N-S, may have trees on one side, perpendicular to streets
  • Lane: narrow, often rural
  • Court: ends in a circle, cul de sac, or loop
  • Circle: encircles an area
  • Based on the resources, it seems that different regions may have different suffix prevalencies depending on original or current land usage. The following plot shows the 8 most common suffixes (by counties containing at least 1 road with that suffix) across the NAD geographic region. Please use the button in the top left to select between suffix and scroll or move around as needed. Heatmap data represents the decimal proportion of roads within that county which use that suffix. For example, St Louis County, MN (very North) has 4901 registered roads in the database, roughly 40% of them are terminated by "Road". If that value were > 50% it would stay the same shade of blue for all values. Some counties are either very red or very blue - often because they only have 1 registered road in this database and that road either matches (1) the selected suffix or doesn't (0).

    Let's look through each of these to see how they match up with the previously written descriptions.

  • Road: Widespread across the region, very abundant in MT, NY, TN, AR, ME, VT. Seems to be a catch-all.
  • Street: Exceptionally abundant in IA, KS, NE, ND, and WI. Not very urban like suggested. However, still quite common in the NYC counties & DC
  • Drive: Consistent ~15-25% usage throughout Appalachia and AZ. Appalachia makes sense given the winding hills of the region.
  • Avenue: IA & ND love avenues to go with their streets. Very common in the Bronx as well as a fewer other scattered counties.
  • Lane: Common in Western MT, and TN, NC, VA, and AR. Likely states to have narrow winding rural roads.
  • Court: Common around suburban metro areas: Chicago/Milwaukee, Raleigh, Nashville, Minneapolis, and Richmond.
  • Circle: Limited usage outside of Elbert CO, and Rich UT. Indicates rare but consistent usage across the counties depicted
  • Overall, most of these suffix descriptions seemed to match their geographic distributions besides the abundance of streets and avenues in some of the farming states like ND and IA. Undefined are all the unlisted, undocumented, or not present suffixes. For isntace, Broadway has no suffix. Included in the Github repository is a separated dataframe containing this data without all 55 million rows if you want to look through it yourself. Every row represents one suffix-county pairing and the count column indicates the number of routes in this county which contain this suffix. 274 suffixes are listed - I encourage you to look for patterns in the ~266 not mentioned. Some notable observations from skimming through the csv:

  • Alcove is only used in 2 counties, both in MN but is present in 104 streets in those two counties. Cove is vastly more common.
  • Circulo is used once in this dataset in Mohave, AZ as the Spanish variant of Circle. Entrada is also used only once... in Pottawattamie, IA.
  • Garth is another regional name - only present in 16 routes in MD. A garth is "an enclosed quadrangle or yard, especially one surrounded by a cloister."
  • Stravenue is used 38 times in NM and AZ with 35/38 of those routes in Pima, AZ. This is quite the cop-out of a suffix if you ask me.
  • Throughway is spelled as such unless you are in East Baton Rouge Parish, LA in which case it is Thruway

  • Numbering

    Street names are only one portion of our overall address, let's now look at the 110,737 unique house numbers. Are there any trends you might be able to predict before looking at figures?

    In Figure E we can see the occurence of house numbers 1-99. Looking at this data a few observations stand out:

  • 10 is our favorite address, with general preference decreasing up to 99
  • We prefer to number addresses in multiples of 5 (and 10 of course)
  • 13 is still taboo for numbering

  • Let's extend this data through 499 and see if the trends continue.

    Yep. The trend does continue for the most part with the obvious addition that with every 100 numbers the frequency jumps up again. Why might this happen? Are communities starting house numbers at 300 and counting up or incrementing by 5? Hard to be certain from the data alone, but one hypothesis is that just as communities prefer to number in multiples of 5 and 10, they are even more preferential toward multiples of 100. This would mean there are a lot of extra zeros on envelopes shipped around the US. Let me know if you have other theories about why these patterns exist in our addresses.

    Links